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■ Abstract 

Ch ' The probabilistic top-A: queries based on the interplay of score and probability, under the 

^3 



possible worlds semantic, become an important research issue that considers both score and 
uncertainty on the same basis. In the literature, many different probabilistic top-fc queries 
are proposed. Almost all of them need to compute the probability of a tuple ti to be ranked 
at the j-th position across the entire set of possible worlds. The cost of such computing is 
the dominant cost and is known as O(fcn^), where n is the size of dataset. In this paper, we 
PQ ' propose a new novel algorithm that computes such probability in 0{kn). 

q- 

c/3 ' 1 Introduction 

O 

Ranking is an import issue in uncertain data, and has attracted a lot of attentions recently. 
The probabilistic top-Zc queries based on the interplay of score and probability, under the 
J> ' possible worlds semantic, were first studied in [14]. In this paper, we show that we can 

■ significantly improve the performance for all the probabilistic top-A; queries in the literature 
\ [16, 17, 6, 7, 8, 18, 3, 11] under the x-Relation model. We achieve it by proposing a new novel 

'nJ" \ algorithm to reduce the dominant cost of computing probabilistic top-A; queries to be 0{kn), 

• which is known to be 0{kn'^), where n is the size of the dataset. 

o 

^ I 2 x-Relation Model and Probabilistic top-k semantics 

J> , In the x-Relation model [1, 17], an x-Relation contains a set of independent x-tuples (called 

Ijpj I generation rules in [14, 7]). An x-tuple consists of a set of mutually exclusive tuples (or called 

■ alternatives) to represent a discrete probability distribution of the possible tuples the x-tuple 



may take in a randomly instantiated data. In an x-tuple, each alternative t has a score score{t), 
and a probability p{t) that represents its existence probability over possible instances. In the 
x-Relation model, the alternatives of x-tuples are assumed to be disjoint. In the following, we 
denote an x-Relation as X, an x-tuple as r, and call an alternative a tuple, denoted as t. 

Example 2.1: Fig. 1(a) shows an x-Relation which consists of three x-tuples, ti = {tijt^}, 
T2 = {^2}) and Ts = {^4}. The x-tuple ri indicates a probability distribution over ti and t^, 
with probability p{ti) = 0.3 for its true content to be ti, with probability ^(^3) = 0.5 for its true 
content to be ^3, and with probability 1 —p{ti) —pit^) = 0.2 for none of ti and ^3 to be the true 
content. □ 

In general, an x-Relation, X, is a probability distribution over a set of possible instances 
{/i,/2, • • • }• A possible instance, Ij, maintains zero or one alternative for every x-tuple r G <Y. 
The probability of an instance Ij, Pr(/j), is the probability that x-tuples take certain or none 
alternatives in Ij, such that Pr(/j) = Hte/ Pi'^) ^ Ht^/ (-'^ ~ ^^i'^)) where r ^ Ij means x-tuple 
r takes no alternative in Ij and Pr(r) = ^t^T-p{t)- The entire set of possible worlds of an 
x-Relation, X, denoted as pwd{X), is the set of all the subsets Ij (C X) with probability greater 
than (Pr(/j) > 0). 
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(b) Possible Worlds 

Figure 1: x-Relation Data 

Example 2.2: Fig. 1(b) shows the total 6 possible worlds for the .x-Rclation in Fig. 1(a). The 
possible world {ti, t2} means that, ti takes the alternative ti, T2 takes the alternative t2, and 
takes none. The probability of this possible world becomes p{ti)p{t2){l — piti)) = 0.06. Note 
that the sum of the probabilities of all the possible worlds is equal to 1. □ 

Probabilistic top-k semantics: Several probabilistic top-fe semantics have been proposed 
recently under the x-Relational model including Uncertain Top-k Query (U-Topk) [14, 17], 
Uncertain k- Ranks Query (U-kRanks) [14, 17], Global-Topk [18], Probabilistic Threshold top-A; 
query (PT-k) [7], and the Probabilistic k top-k query (Pk-topk) [8]. The PT-k and Pk-topk are 
similar to the Global-Topk. Global Top-k query finds k tuples with the highest top-fe probability. 
PT-k finds all the tuples that have top-A; probability above a user-given threshold. Pk-topk finds 
k tuples with the highest top-A; probability in a data stream environment, where every tuple is 
independent. All the above existing solutions except U-Topk need to compute the probability 
of a tuple, ti, to be ranked at the j-th position across the entire set of possible worlds, denoted 
PiJ- 

Below, we introduce U-kRanks and Global Top-k with the emphasis on how pij is used. Let 
PiJ be the probability of a tuple ti to be ranked at the j-th position across the entire set of 
possible worlds [14, 17]. 



where ^j(/) denote the tuple with the j-th largest score in an instance I of the possible 
worlds. The answer to a U-kRanks query on an x-Relation X is a vector (t* , ■ 
t* = argmaxf - pij for j = 1, • • • ,k. Let tkp{ti) be the top-A; probability of a tuple, t 
the marginal probability that ti is ranked top-A; in the possible worlds [18]. 



tl), where 
which is 



tkp{ti)= pr(/) = ^p„- 

iepwd{x),tietopk{i) j=i 



(2) 



where ti G topk{I) means that the tuple ti is ranked as one of the top-A; tuples in the instance 
/. The answer to a Global Top-k query on an ,T-Relation X is & set of size k, {t\, - ■ ■ , ife}, which 
satisfies tkp{t*) > tkp{t) for any j = !,■■■ ,k and t ^ {t\, ■ ■ ■ , tl}. 

Example 2.3: The U-2Ranks query on Fig. 1(b) returns 2 tuples, t2 {score{t2) = 90) and ^3 
(score(t3) = 80), for t2 is ranked top and t^ ranked 2nd. The probability for t2 to be ranked 
top is p2 1 = 0.04 -I- 0.16 -I- 0.1 -I- 0.4 = 0.7 and the probability for t^ to be ranked 2nd is 



P3,2 = 0.1 + 0.4 = 0.5. The tuple ti has the highest score 100 but with a low probability 0.3, 
therefore, it is not a result in U-2Ranks. The Global Top-2 query returns a set of 2 tuples 
{^2,^3}. Here tkp{t2) = 0.04 + 0.16 + 0.06 + 0.24 + 0.10 + 0.40 = 1.0, because t2 is ranked 
as a top-2 tuple in every instance, and tkp{ts) = 0.10 + 0.40 = 0.5, because t3 is ranked as a 
top-2 tuple only in two instances. Note that the results of U-kRanks and Global Top-k do not 
necessarily the same. □ 

It is important to note that all these probabilistic ranking queries, namely, U-kRanks, Global- 
Topk, PT-k, and Pk-topk, need to compute the pij values for all ti € X and j = 1, ■ ■ ■ , k, and 
computing pij is the dominant cost in such probabilistic ranking queries. 

3 pij Computing 

We discuss pij computing for a given k and an x-Relation A" = {ti,-- - sorted in the 
descending score order. For simplicity and without loss of generality, in the following discussions, 
we further assume there are no tie scores in X such that score{ti) / score{tj) for any ti ^ tj 
in X. Note that all algorithms including our algorithm to be discussed can deal with tie scores 
with minor modification for computing pij. 

[17] showed that the time complexity of computing pij for all ti & X and j = 1, • • • , A; is 
0{kn'^). We introduce it in brief below. 

Given an x-Relation X = {ti, - ■ ■ ,tn} sorted in the decreasing score order. Let Xi = 
{ti,--- ,ti} denote a reduced x- Relation on the largest i tuples, together with the projected 
(exclusive/independent) relationship between tuples. It is obvious that y is the same to be 
computed either on X or X^, under the x-Relation model. Formally, let Pr(T|A'j) be the existence 
probability of an x-tuple r with respect to X^ as follows. 



Then, Pr(r) = Pi{t\X). 

We highhght the main idea of computing pij in 0{kn'^) [17] below. First, consider a special 
case, where every .x-tuple contains only one tuple (single-alternative), or equivalently, all the 
tuples are independent. Then, pij is equal to the probability that a randomly generated possible 
world from Xi contains ti and there are j tuples in total. In other words, pi^j is the sum of the 
probabilities of the possible worlds that contain ti and there are exactly j — 1 tuples taken from 
the set Xi-i = {ti, ■ ■ ■ , ti-.\}. Let denote the probability that a randomly generated possible 
world from Xi has exactly j tuples, then pjj = • rt_ij_i. For the totally independent 

case, the set of all Vij values can be computed efficiently by the following dynamic programming 
equation, in time complexity 0{kn). 



Second, consider the case where some x-tuples may contain multiple tuples (multi-alternative). 
The noticeable difference is that ^ p{ti) ■ ri_ij_i in the multi-alternative case, because an 
x-tuple contains multiple-alternatives that are mutually exclusive. When it needs to compute 
Pij for a tuple ti, the x-tuple that contains ti may have other alternatives been computed 
already. It needs to remember whether an alternative of an x-tuple has already been computed 
in Xi-i using a set denoted S. Let S = {ti, • • • ,Ts} be the set of x-tuples, that have at least 
one alternative computed in Xi^i already, with probability Pr(T;[A'j_i) for 1 < / < s (Refer to 
Eq. (3)). When ti appears and the x-tuple Tx that contains ti has already appeared in S, it 
computes pij as pij = p{ti) ■ r'^j^i- Here, r'^j^i, for 1 < z < s and 1 < j < k, need to be 




(3) 




(4) 



recomputed based on S = {ri, • • • ,Ts} with Pi:{Tx\Xi-i) = using Eq. (4), and takes 0{s • k) 
time. In the worst case, it takes 0{i ■ k) to compute pij for a specific i. The time complexity to 
compute pij values, for 1 < i < n and 1 < j < A;, is 0{kin?). 
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Table 1: Multi- alternative x- Relation 

Example 3.1: Consider an a;-Relation, X, in Table 1 with 4 x-tuples, {ti, r2, ra, t^} and 8 tuples 
{^1) ■ ■ ■ j^s}- Each a;-tuple contains two tuples (alternatives). We assume score{ti) > score{tj) 
if i < j, and give the probability of each tuple U, p{ti), in the corresponding parentheses. For 
example, Ti has two tuples ti and t4 where p{ti) = 0.3 and p(t4) = 0.4. Let k = 2. We show 
how to compute pij for all tuples tj, for 1 < i < 8 and j = 1,2. 

Let all 8 tuples in X be sorted in the decreasing score order, and let S be the set of a:;-tuples 
that have multi-alternatives in Xi^i. Initially, Aq = 0, 5 = 0. 

First, consider ti which is the tuple that has the largest score, and 5 = implies that ti 
has no preceding alternatives. Because Tq g = 1 and r'g i = 0, thus pi^i = p{ti) ■ r'^ Q = 0.3 and 
Pi,2 = p{ti) ■''01 = 0. Xi = {ti}. Based on Eq. (3), the current existence probability of ri in Xi 
is Pr(Ti|A:'i) = p{ti) = 0.3. <S is updated to be »S = {ti}, because the x-tuple ri contains ti that 
has been computed. For simplicity, we use S = {ri(0.3)} to indicate that <S contains ri whose 
current existence probability is 0.3. 

Second, consider the second largest score tuple t2, which has no preceding alternatives 
computed, because the x-tuple T2 that contains t2 does not appear in <S = {ti(0.3)}. Because 
r'lQ = 0.7 and = 0.3, thus ^32,1 = ^(^2) • ^^'1,0 = 0.35 and p2,2 = 0.15. X2 = {ti,t2}. 
Based on Eq. (3), the current existence probability of T2 in X2 is Pr(r2|A2) = p{t2) = 0.5. 
5 = {ri(0.3),r2(0.5)}. 

In a similar fashion, the third largest score tuple ts is computed which has no preceding 
alternatives in S. Because r'^Q = 0.35 and r'^i = 0.5, thus p^^i = 0.5 • 0.35 = 0.175 and 
^3,2 = 0.25. X'i = {^1,^2)^3}- Based on Eq. (3), the current existence probability of T3 in is 
Pr(r3|A'3) =p(t3) = 0.5. S = {ri(0.3), r2(0.5), r3(0.5)}. 

Fourth, consider the fourth largest score tuple t4. Note that the current S = {ti(0.3), r2(0.5), r3(0.5)}. 
But because tuple ^4 has a preceding alternative ti in x-tuple ti which appears in S already, 
the existence probability of Pr(TilA3) = is reset. Therefore, S is updated to be <S = 
{ri(0), r2(0.5), r3(0.5)}. In order to compute r'^Q and r'^i, all the r'^j values, for i = 1,2 
and j = 0, 1, need to be recomputed as well based on the updated S. Because g = 1> 
r'l i = 0, Q = 0.5, r2 j = 0.5, r'^ Q = 0.25, and r'^^ = 0.5, thus p4j = ^(^4) • t'^q = 0.1 and 
P4,2 = 0.2. X4 = {ti,t2,t3,t4^}. Based on Eq. (3), the current existence probability of ri in X4 
is Pr(ri|-%'4) = p{ti)+p{U) = 0.3 + 0.4 = 0.7. Therefore, S = {ri(0.7), r2(0.5), r3(0.5)}, which 
will be used in the next iteration. 

The same procedure repeats until all pij for all ti E X and j = 1,2 are computed. □ 

Note that, between consecutive computations of jOj.j and Pi+ij, some j computing cost can 
be shared [7, 17]. [7] also studied several heuristics to fast compute pij but in the worst case it 
is 0{kv?). 

4 A New Novel Algorithm 

In this paper, we propose a novel 0{kn) algorithm using a newly introduced conditional proba- 
bility Cjj given below, 

Cjj = Pr(Exactly j tuples appear in {ti, ■ ■ ■ ,ti} \ U+i appears) (5) 



to fast compute pij. Consider a general multi-alternative case. Let Xi = {ti, ■ ■ ■ be the set 
computed already. Now, we consider tj+i, assume tj+i appears. Among the tuples computed 
already in A^, there may exist several tuples in Xi that are contained in the same x-tuple that 
contains tj+i. Those tuples need to be removed in order to compute for fj+i, as we discussed in 
the previous section by setting the existence probability to be zero. Eq. (5) is the conditional 
probability of having exactly j tuples in Xi = {ti, • • • , tj} after removing those tuples in Xi that 
are contained in the same x-tuple that contains i^+i, given tj+i appears. It is interesting to note 
that 

Pi^j = Pr(tj appears) • Pr(Exactly j-1 tuples appear in {ii, • • • ,ti-i\ \ U appears) 



And the problem becomes how to compute q j efficiently. Note that there is no obvious 
relationship between Cjj- and Cj_ij- (refer to Eq. (4)). However, we observe that there is a 
similar relationship between Cjj and rjj. Let Tx be the x-tuple that contains tj+i. Then, the 
relationship between Cjj- and Vij becomes as follows, 



Proof Sketch: Assume that Cjj- for < j < fc — 1 are correct as defined, the probability that a 
randomly generated possible world has exactly j tuples from Xi is conditioned by the appearance 
of ij+i. Let Tx be the x-tuple that has tj+i, and p denote Y'i{Tx\Xi). There are two cases. 

First, tj+i has no preceding alternative, equivalently p = 0. Then the two parts in the 
conditional probability Qj- are independent, = Pr(Exactly j tuples appear in {ti, • • • ,ti}), 
where the latter part of the equation is actually rj^-. Hence, Eq. (7) correctly computes r^j, 
given that Qj- are correct. 

Second, tj+i has some preceding alternatives, equivalently p > 0. Assume that <S = {ri, • • • , r^, Tx} 
is the set of x-tuples that have alternatives appearing in = {ti, • • • , tj}, where Pr(r;|^j) > 
for all Ti € S. Then Qj- is the probability that a randomly generated possible world from 
{ri,--- ,Ts} {= S \ {tx}) has exactly j x-tuples, and rjj is the probability that a randomly 
generated possible world from {ri, ■ ■ ■ ,Ts,Tx} has exactly j x-tuples. Hence, Eq. (7) is correct 
based on the same idea shown in Eq. (4). □ 

Given Qj- we can compute rjj using Eq. (7). The reverse also holds such that, given rjj, we 
can compute Cjj- correctly by the system of linear equations defined in Eq. (7). A general system 
of linear equations with n equations and n variables needs time O(n^). But the system of linear 
equations defined by Eq. (7) has a special form, there are only two diagonals of the coefficient 
matrix which are non-zero, so it can be solved in 0{n) time [9]. In our problem, there are k 
linear equations with k variables, it can be solved in time 0{k), using q^q = ^i,o/(l — p) and 
Cjj = {rij—p-Cij-i)/{l—p) where p = P-r{Tx\Xi), for I < j < k—1. Note that < Pr(Ta;|Afj) < 1. 
In addition, given Cij, rj+ij can also be computed using Eq. (7), by replacing Pr(ri;|Afj) with 
Pi{Tx\Xi^i), where Tx is the x-tuple that contains tj+i. 

The algorithm to compute rij and pij values for a tuple ti is shown in Algorithm 1. It takes 
three inputs, namely, the tuple ti, the Vi-ij values, and a set of x-tuples, S = {ri, • • • ,Ts}, that 
have been computed with their probability Pr(T/) = Pr(T;|A'j_i). It first computes p (line 1-2). 
Then, it computes the Cj_ij- values by solving a system of linear equations defined by Eq. (7) 
(line 3-5), and computes the pij values (line 6). In line 7-10, it computes the r^j values using 
Eq. (7). Finally, it updates the probability Pr(ra;) (line 11-14). Note that, in our algorithm, the 
only values needed to compute pij values are r^-ij values and PT{Tx\Xi^i). 

Theorem 4.1: Algorithm 1 correctly computes the pij values with time complexity ofO{k). □ 



= P{ti) ■ Ci-lJ-l 



(6) 




(7) 



Lemma 4.1: Eq. (7) correctly computes rij, given Cij. 



□ 



Algorithm 1 CondProb(»S, IZi-i^U) 

Input: the probability for x-tuples S = {Ti(Pr(ri)), • • • ,rs(Pr(rs))} 

7^i_l = {rt-ifl, ■■■ , rj_i^fc„i} and a tuple U. 
Output: Vij-i and pij, for I < j < k. 

1: Let Tx be the x-tuple that has U; 

2: p ^ Pr(Ta;) if Tx(Pii:{tx)) appears in S otherwise 0; 

/ / compute Ci-ij and pij for < j < A; — 1 
3: Ci_i,o Vi-ifi/il - p); 
4: for J ^ 1 to — 1 do 
5: Cj-i J ^ (j-i-ij - p ■ Ci-ij-i)/ (1 - p); 
6: Pij ^ p{ti) ■ Ci-ij-i, for 1 < j < A;; 

/ / compute Tij ioT < j < k — 1 
7: p^ p + p{ti); 
8: nfi ^ (1 - p) • Ci-ifl] 

9: for j <— 1 to A; — 1 do 

10: Vij (1 - p) • Ci_ij + p • Cj_ij_i; 
11: if ^ 5 then 
12: 5 ^ 5 U K(p)}; 
13: else 

14: update S by changing Pr(rj;) to be p; 

15: return {S, {rj,o,--- {Pi,i,--- ,Pi,k}); 

Proof Sketch: It is obvious from the discussions above. □ 
In order to compute all pij, wc enumerate all tj € X, which is sorted in the descending order 
score, such as score{ti) > score{tj) if i < j as given below. 
1: Let 5 = 0; 

2: Let TZq = {ro,o, ro^i, • • • , ro,fc_i} where roj, for < j < — 1, are computed; 
3: for z = 1 to n do 

4: {S,ni,Vi) ^ CondProb(5, 7^^_l, U); 

5: output Vi = {pi,l,Pi,2, ■ ■ ■ ,Pi,k}; 

It is obvious that the time complexity to compute all pij is 0{kn). 

Fig. 2(a) illustrates the existing O^k'n?) approach to compute r^ j in the stage i based on the 
stage Note that the stage i is the i-iteration to compute for the i-th largest score tuple in 
Xi. On the left side in the stage i-1 and the stage i, it indicates that some .x-tuple (marked 
by •) contains several tuples (alternatives). On the other hand, Fig. 2(b) illustrates our 0{kn) 
approach to compute rj j, using Qj-, in the stage i based on the stage The shaded parts 
in Fig. 2(a) (b) indicate the equations needed to compute, and the difference between the two 
shaded regions confirms the significant cost saving of our approach. 

Example 4.1: Consider the example x-Relation in Table 1. Wc show the steps of our algorithm 
to compute p-^j. Let k = 2. We denote the sequence of x-tuples that have been scanned as «S. 
Initially, Aq = 0, 5 = 0, ro,o = 1 and ro,i = 0. 

First, consider ti which is the largest score tuple. It has no preceding alternatives, Pr(ri) = 0, 
Co,o = 1 and cq^i = 0. Then, pi^i = p{ti) ■ cq^o = 0.3 and pi^2 = 0. After computing ti, Xi = {ti}, 
S = {ri(0.3)}, and we have ri,o = (l-Pr(Ti))-co,o = 0.7andri,i = (l-Pr(Ti))-co,i+Pr(Ti)-co,o = 
0.3. 

The second largest score tuple t2 has no preceding alternatives, Pr(T2) = 0, ci,o = '"i.o = 0.7 
and ci^i = 0.3. Then, p2,i = p(*2) • ci,o = 0.35 and p2,2 = 0.15. After computing t2, X2 = {ti, i2}, 
«S = {ri(0.3), T2(0.5)}, and in addition we have r2,o = 0.35 and r2,i = 0.5. 
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(b) Our New 0{kn) Approach 

The third largest score tuple t-f\as no^prccecfing aficrnatives Pr(r3) = 0, C2,o = 0.35 and 
C2,i = 0.5. Then, p^^i = pits) ■ C2,o = 0.175 and p3^2 = 0.25. After computing t^, Xj, = {^1,^2, ^3}, 
«S = {ri(0.3), r2(0.5), r3(0.5)}, and in addition we have ra^o = 0.175 and ra^i = 0.425. 

The fourth largest score tuple t/^ has a preceding alternative t\ that is contained in x- 
tuple T\ which appears in S. Therefore, p = Pr(ri|A:3) = 0.3, 03^0 = ''3,o/(l ~ p) = 0.25, 
C3,i = (r3,i — p ■ C3,o)/(l - p) = 0.5, p4,i = p{ti) ■ C3,o = 0.1 and ^4,2 = 0.2. After computing 
i4, <^4 = {ti, ^2; ^3, ^4}) S = {ti(0.7), r2(0.5), r3(0.5)}, and in addition we have r4^o = 0.075 and 
r4,i = 0.325. 

The same procedure repeats until all pi^j for all U e X and j = 1,2 are computed. □ 



5 Top-k Generator 

Algorithm 1 returns the set of pij, which can be used to compute the top-fc probability of a 
tuple, e.g. tkpiti) = Ylj=iPi,j- ^ naive way to get the top-fc result is to first compute the top-fc 
probabilities for all tuples, then report the top-Zc tuples with respect to the top-A; probability. 
In the following, we will first discuss an upper bound, and then propose an early stop condition, 
which avoids to retrieve all the tuples. 

Lemma 5.1: Let {ti,--- ,ti,---} be the order we scan the tuples, or equivalently it is the 
decreasing score order, and ri^j is defined as above. Then tkp{ti^i) < Yl^=i''^i,j> f'^'^ i > 1. 
This upper bound is also tight for an arbitrary sequence of tuples. □ 

Proof Sketch: Let Tx be the x-tuple that have tj+i, and p = Y'i{Tx\Xi). Note that pj may 
be zero, or equivalently t^+i has no preceding alternative. By Eq. (7), sum up the rj^j's, 

Ej=o n,j = Ej=o - P ■ We have 



tkp{ti+i) = Yl']=lPi+l,j 

= p{U+i) ■ Yl'^jZo ci,j 

Ek — l v-^fc — 1 

where the third inequaUty holds because Pr(ra;|<^i) < 1, as tj+i is an alternative of 

x-tuple Tx- So tkp(tiJ^\) < Ylj=i'''hj- When = 1, Pr(ra;|Ai) = 0, the above inequalities 

hold with equality, and therefore tkp{ti^i) = J2'j=i ^ij- Hence this upper bound is tight. □ 

Lemma 5.2: Yl'jZo'^iJ c-f^ i'^ decreasing order, e.g. ^^=o n j > l^^=o n+ij; for any i>l. □ 

Proof Sketch: There are two cases, ij+i has preceding alternatives or not. 

First, if tj+i does not have preceding alternatives, then rj+ij- can be computed by Eq. (4). 

Summing up rj+ij, we have Y!jZl'ri+i,j = Ej=o ^"'j - P{U+i)ri,k-i < Y!jZln,j- Second, 
if tj+i has preceding alternatives, assuming tj+i is in the x-tuplc t^, then VT{Tx\Xi) > 0. 
Assume that {ri, • • • , r^,, • • • , r^} is the set of x-tuples that have alternatives in {ii, • • • , ti}, with 
probability Pr(r|A'j). Then {ri, • • • , Ta;, • • • , Tg) is also the set of x-tuples that have alternatives 
in {ti,--- and their probability is Pr(r|r%'i+i), with Pr(r|A'j+i) = Pr(r|A'j) for all 

.T-tuple r except t.^, which has Pr(r2;|<^j+i) > Pr(r^|^j). Let be the probability that a 
random generated possible world from {ti, • • • ,Tx, - ■ ■ .,Ts}/tx, with probabilities Pr(r|Afj), has 
exactly j x-tuples. The relationship between r^j and r*j-, or between rj+ij- and r*j-, is the 
same as Eq. (4) or Eq. (7). Then YljZ] rij = Y1^-Zq r*j - FT{Tx\Xi) ■ r*,fe_i, and YI^Zq n+ij = 
Yl']Zo r*j - Pr(r^|Ari+i) • r^^k-i- So Yl'jZo r^+ij < Yl'jZo n^j, as Pr(Tr|A:'i+i) > Vv{Tx\Xi). □ 

Theorem 5.1: // all the top-k probabilities of the current top-k result, e.g. from the set 

{ti,--- ,ti}, are greater than or equal to Yl^^Qrij, then we can stop, and guaranty that any 
potential results in {tj+i, • • • jt^} can not be in the top-k result. □ 

With Theorem 5.1, we can develop an algorithm to compute the top-fc tuples with respect to 
their top-k probabilities, which is shown in Algorithm 2. It initializes in line 1-5, and upBound 
denotes the upper bound of the top- A; probabilities of the remaining tuples (line 5). While 
the stop condition is not satisfied (line 6), it retrieves the next largest score tuple (line 7), 
computes its top-fc probability, inserts it into the top-fc set (line 8-10), and update the upper 
bound (line 11). The top-A; set is maintained as a min-heap with size of k, top-k[k].tkp (line 6) 
is the minimum top-A; probability in the min-heap. When inserting a new tuple associate with 
its top-k probability into min-heap, if its top-A; probability is smaller than that at the top of the 
min-heap, we do not need to insert it. Otherwise, we replace the top tuple of the min-heap with 
the new tuple and update the heap structure. 

Theorem 5.2: Algorithm 2 correctly returns the top-k tuples with highest top-k probabilities. 
The top-k generator takes time 0{n{k -\- log(k)), where n is scan depth, or equivalently the 
number of calls NextQ. □ 

Proof Sketch: The correctness directly follows from the above discussions. 

The time complexity of 0{n{k -\- log{k)) does not take NextQ into consideration. The initial 
of line 1-5 takes constant time. Each call of Prob{) (Algorithm 1) takes 0{k) time, based on 
Theorem 4.1. Line 9, 11 take time 0{k). Line 10 takes time 0{log{k)), due to the min-heap of 
size k. Line 6-11 are only executed n times, so the total time complexity is 0{n{k -\- log{k)). □ 



Algorithm 2 Top-k (k) 

Input: an integer k, specify the top-fc value, 
Output: top-fc tuples. 

1: Let {ti, ■ • • , r„, } be the set of all the a;-tuples; 
2: Initialize Pr(ri) ^ 0, for 1 < i < m; 
3: top-k <- <ll; 

4: Initialize ro = 1 and rj = for 1 < j < k — 1; 

5: upBound ^ Sj=o ^j' 

6: while top-k[k].tkp < upBound do 

7: t ^ Next{): 

8: rj,pj ^ Pro6({Pr(ri), • • • , Pr(Tm)}, {ro, • • • ,rfe_i},t); 

9: tkp{t) ^J2j=iP3'^ 
10: Insert t into top-k; 
11: upBound <— J2^Zo ''"j: 
12: return top-k; 



Parameter 


Range 


Default 


mem-p 


0.1, 0.3, 0.5, 0.7, 0.9 


0.5 


P 


0.1, 0.3, 0.5, 0.7, 0.9 


0.3 


k 


200, 400, 600, 800, 1000 


200 


\rule\ 


5, 10, 15, 20, 25 


10 


#tuple 


20000, 40000, 60000, 80000, 100000 


20000 




500, 1000, 1500, 2000, 2500 


2000 



Table 2: Parameters and Default Values 

6 Experiment 

We have implemented our algorithm in Visual C++. We compare our CondProb algorithm, 
denoted CP, for computing pij, with the heuristics proposed in [7] which are RC (rule-tuple 
compression only), RC+AR (RC with aggressive reordering), and RC+LR (RC with lazy re- 
ordering). The heuristics proposed can improve the efficiency but they are algorithms in 0{kn'^), 
where n is the number of tuples and k is the top-k value. The executable code and data generator 
used in [7] are downloadable-*^ . We use exactly the same synthetic dataset as used in [7], which 
is also included in the package. 

The parameters and default values arc shown in Table 2. Here, mem-p is the expectation 
of the membership probability of tuples, p is the threshold specifying the minimum top-fc 
probability of the result tuples returned, k is the top-A; value, \rule\ is the average number 
of tuples in a rule (x-tuple), #tuple is the total number of tuples, and H^rule is the total number 
of rules (x-tuplcs). 

The experimental results are shown in Fig. 3. In all figures, the shape of the curves for all the 
four algorithms are all similar, our CP algorithm is 3, 000 times faster than RC+LR on average, 
and 30, 000 times faster than RC on average. 

7 Related work 

Uncertain data has received increasing attention recently, most of them represent the uncertainty 
as probability values, also called probabilistic data. Many probabilistic data model and systems 
have been proposed, for example. Trio system [1], MystiQ system [5], MayBMS system [2]. 

In the literature, several works study computing the top-k results by the interplay of score 
and probability, based on the possible worlds semantic. U-Topk and U-kRanks queries are first 
proposed in [14] on a general uncertain data model. [16, 17] improve the performance of the U- 
Topk and U-kRanks queries using a dynamic programming approach, under an x-Relation model, 

^http : //www . cs . sf u . ca/~ jpei/Sof tware/PTKLib . rar 
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Fi gurc 3: Computing pi j 

by utilizing the independent and mutually exclusive relationship between tuples. [6, 7] define the 
PT-k query, and propose three heuristic approaches to answer the PT-k queries. In [16, 17, 6, 7], 
to answer a U-kRanks or PT-k query, they all need to compute pij, the probability that tuple 
ti ranks at the j-th position in possible worlds, for 1 < i < n and 1 < j < k, with the time 
complexity ©(fcn^). [8] adapt the U-Topk/U-kRanks/Global-Topk (Global-Topk [18] is the same 
as Pk-topk in [8] ) queries in a uncertain stream environment under a sliding- window model, and 
design both space- and timc-cfficicnt synopses to continuously monitor the top-A; results. But, 
[8] only consider the single-alternative case, or in other words, all tuples arc independent. [3, 11] 
also need to compute the pij values, running the probabilistic ranking in a middleware to answer 
ranking spatial queries on uncertain spatial data. [15] discusses aggregate queries. 

There are also works that find the top-fc results based on the probability only. In [13], 
Re et al. find the k most probable answers for a given general SQL query. In this scenario, 
each answer has a probability instead of a score, which intuitively represents the confidence of 
its existence, ranking is only based on probabilities. They use Monte Carlo simulations to get 
the top-fc results efficiently, as in general it is #P-complete to get the existence probability [5] . 
[12, 10, 4] retrieve k objects from a uncertain spatial database, that have the highest probability 
to be a skyline point or nearest neighbor. 

8 Conclusion 

The probabilistic top-A; queries based on the interplay of score and probability, under the possible 
worlds semantic, become an important research issue that considers both score and uncertainty 
on the same basis. In the literature, many different probabilistic top-A; queries are proposed. In 
the x-Relational model, an x-tuple consists of a set of mutually exclusive tuples to represent a 
discrete probability distribution of the possible tuples in a randomly instantiated data. Almost 
all of them need to compute the probability of a tuple ti to be ranked at the j'-th position 
across the entire set of possible worlds. We call it pij computing. The cost of computing pij 
is the dominant cost and is known as 0{k'n?), where n is the size of dataset. In this paper, we 



proposed a new novel algorithm that computes such probability efficiently based on conditional 
probability and the system of linear equations. We proved the correctness of our approach, 
and showed that the time complexity is 0{kn). We confirmed the efficiency by comparing our 
approach with the up-to-date heuristics and found that our approach can be at least 3, 000 times 
faster. 
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